{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a7029ad2",
   "metadata": {},
   "source": [
    "# Week 1: Explore the BBC News archive\n",
    "\n",
    "Welcome! In this assignment you will be working with a variation of the [BBC News Classification Dataset](https://www.kaggle.com/c/learn-ai-bbc/overview), which contains 2225 examples of news articles with their respective categories.\n",
    "\n",
    "\n",
    "#### TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:\n",
    "- All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.\n",
    "\n",
    "- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.\n",
    "- You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!\n",
    "- Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.\n",
    "- To submit your notebook, save it and then click on the blue submit button at the beginning of the page.\n",
    "\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18e757f2",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "deletable": false,
    "editable": false,
    "id": "zrZevCPJ92HG",
    "outputId": "4ba10d10-1433-42fc-dc8e-83b297705cfe",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48119af8-9469-420c-acb6-d57f03e1f20a",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "import unittests"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ee7f84",
   "metadata": {},
   "source": [
    "Begin by looking at the structure of the csv that contains the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffae3497",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "with open(\"./data/bbc-text.csv\", 'r') as csvfile:\n",
    "    print(f\"First line (header) looks like this:\\n\\n{csvfile.readline()}\")\n",
    "    print(f\"Each data point looks like this:\\n\\n{csvfile.readline()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1624e420",
   "metadata": {},
   "source": [
    "As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28e9d216",
   "metadata": {},
   "source": [
    "## Exercise 1: parse_data_from_file\n",
    "\n",
    "The csv is a very common format to store data and you will probably encounter it many times so it is good to be comfortable with it. Your first exercise will be to read the data from the raw csv file so you can analyze it and built models around it. To do so, complete the `parse_data_from_file` function below. \n",
    "\n",
    "Since this format is so common there are a lot of ways to deal with this files using python, both using the standard library or third party libraries such as [pandas](https://pandas.pydata.org/). Because of this the implementation details are entirely up to you, **the only requirement is that your function returns the `sentences` and `labels` as regular python lists**.\n",
    "\n",
    "**Hints**:\n",
    "- Remember the file contains headers so take this into consideration.\n",
    "\n",
    "\n",
    "- If you are unfamiliar with  libraries such as pandas or numpy and you prefer to use python's standard library, take a look at [`csv.reader`](https://docs.python.org/3/library/csv.html#csv.reader), which lets you iterate over the lines of a csv file. \n",
    "- You can use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function from the pandas library.\n",
    "- You can use the [`loadtxt`](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) function from the numpy library.\n",
    "- If you use any of the two latter approaches remember you still need to convert the `sentences` and `labels` to regular python lists, so take a look at the docs to see how it can be done.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd40ca26",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: parse_data_from_file\n",
    "\n",
    "def parse_data_from_file(filename):\n",
    "    \"\"\"\n",
    "    Extracts sentences and labels from a CSV file\n",
    "    \n",
    "    Args:\n",
    "        filename (str): path to the CSV file\n",
    "    \n",
    "    Returns:\n",
    "        (list[str], list[str]): tuple containing lists of sentences and labels\n",
    "    \"\"\"\n",
    "    sentences = []\n",
    "    labels = []\n",
    "\n",
    "    ### START CODE HERE ###\n",
    "\n",
    "\t\n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return sentences, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2c9a86e",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Get sentences and labels as python lists\n",
    "sentences, labels = parse_data_from_file(\"./data/bbc-text.csv\")\n",
    "\n",
    "print(f\"There are {len(sentences)} sentences in the dataset.\\n\")\n",
    "print(f\"First sentence has {len(sentences[0].split())} words.\\n\")\n",
    "print(f\"There are {len(labels)} labels in the dataset.\\n\")\n",
    "print(f\"The first 5 labels are {labels[:5]}\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2bf934",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "```\n",
    "There are 2225 sentences in the dataset.\n",
    "\n",
    "First sentence has 737 words.\n",
    "\n",
    "There are 2225 labels in the dataset.\n",
    "\n",
    "The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6ad47d2",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_parse_data_from_file(parse_data_from_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2db58b95",
   "metadata": {},
   "source": [
    "**An important note:**\n",
    "\n",
    "At this point you typically would convert your data into a `tf.data.Dataset` (alternatively you could have used [tf.data.experimental.CsvDataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset) to do this directly but since this is an experimental feature it is better to avoid when possible) but for this assignment you will keep working with the data as regular python lists. \n",
    "\n",
    "The reason behind this is that by using a `tf.data.Dataset` some parts of this assignment will be much more difficult (in particular the next exercise), because when dealing with tensors you need to take additional considerations that you don't need to when dealing with lists and since this is the first assignment of the course, it is best to keep things simple. During next week's assignment you will get to see how this process looks like but for now carry on with the data in this format and worry not since TensorFlow is still compatible with these data formats!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9df59b3",
   "metadata": {},
   "source": [
    "## Exercise 2: standardize_func\n",
    "\n",
    "One important step when working with text data is to standardize it so it is easier to extract information out of it. For instance, you probably want to convert it all to lower-case (so the same word doesn't have different representations such as \"hello\" and \"Hello\") and to remove the [stopwords](https://en.wikipedia.org/wiki/Stop_word) from it. These are the most common words in the language and they rarely provide useful information for the classification process. The next cell provides a list of common stopwords which you can use in the exercise:\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40742129",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# List of stopwords\n",
    "STOPWORDS = [\"a\", \"about\", \"above\", \"after\", \"again\", \"against\", \"all\", \"am\", \"an\", \"and\", \"any\", \"are\", \"as\", \"at\", \"be\", \"because\", \"been\", \"before\", \"being\", \"below\", \"between\", \"both\", \"but\", \"by\", \"could\", \"did\", \"do\", \"does\", \"doing\", \"down\", \"during\", \"each\", \"few\", \"for\", \"from\", \"further\", \"had\", \"has\", \"have\", \"having\", \"he\", \"he'd\", \"he'll\", \"he's\", \"her\", \"here\", \"here's\", \"hers\", \"herself\", \"him\", \"himself\", \"his\", \"how\", \"how's\", \"i\", \"i'd\", \"i'll\", \"i'm\", \"i've\", \"if\", \"in\", \"into\", \"is\", \"it\", \"it's\", \"its\", \"itself\", \"let's\", \"me\", \"more\", \"most\", \"my\", \"myself\", \"nor\", \"of\", \"on\", \"once\", \"only\", \"or\", \"other\", \"ought\", \"our\", \"ours\", \"ourselves\", \"out\", \"over\", \"own\", \"same\", \"she\", \"she'd\", \"she'll\", \"she's\", \"should\", \"so\", \"some\", \"such\", \"than\", \"that\", \"that's\", \"the\", \"their\", \"theirs\", \"them\", \"themselves\", \"then\", \"there\", \"there's\", \"these\", \"they\", \"they'd\", \"they'll\", \"they're\", \"they've\", \"this\", \"those\", \"through\", \"to\", \"too\", \"under\", \"until\", \"up\", \"very\", \"was\", \"we\", \"we'd\", \"we'll\", \"we're\", \"we've\", \"were\", \"what\", \"what's\", \"when\", \"when's\", \"where\", \"where's\", \"which\", \"while\", \"who\", \"who's\", \"whom\", \"why\", \"why's\", \"with\", \"would\", \"you\", \"you'd\", \"you'll\", \"you're\", \"you've\", \"your\", \"yours\", \"yourself\", \"yourselves\" ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d81ac27",
   "metadata": {},
   "source": [
    "To achieve this, complete the `standardize_func` below. This function should receive a string and return another string that excludes all of the stopwords provided from it, as well as converting it to lower-case. \n",
    "\n",
    "**Hints:**\n",
    "- You only need to account for whitespace as the separation mechanism between words in the sentence.\n",
    "\n",
    "- The list of stopwords is already provided for you as a global variable you can safely use.\n",
    "\n",
    "- Check out the [lower](https://docs.python.org/3/library/stdtypes.html#str.lower) method for python strings.\n",
    "\n",
    "- The returned sentence should not include extra whitespace so the string \"hello&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;again&nbsp;&nbsp;&nbsp;FRIENDS\" should be standardized to \"hello friends\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33637e35",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: standardize_func\n",
    "\n",
    "def standardize_func(sentence):\n",
    "    \"\"\"Standardizes sentences by converting to lower-case and removing stopwords.\n",
    "\n",
    "    Args:\n",
    "        sentence (str): Original sentence.\n",
    "\n",
    "    Returns:\n",
    "        str: Standardized sentence in lower-case and without stopwords.\n",
    "    \"\"\"\n",
    "    \n",
    "    ### START CODE HERE ###\n",
    "\n",
    "\t\n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ea8a832",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "test_sentence = \"Hello! We're just about to see this function in action =)\"\n",
    "standardized_sentence = standardize_func(test_sentence)\n",
    "print(f\"Original sentence is:\\n{test_sentence}\\n\\nAfter standardizing:\\n{standardized_sentence}\")\n",
    "\n",
    "standard_sentences = [standardize_func(sentence) for sentence in sentences]\n",
    "\n",
    "print(\"\\n\\n--- Apply the standardization to the dataset ---\\n\")\n",
    "print(f\"There are {len(standard_sentences)} sentences in the dataset.\\n\")\n",
    "print(f\"First sentence has {len(sentences[0].split())} words originally.\\n\")\n",
    "print(f\"First sentence has {len(standard_sentences[0].split())} words (after removing stopwords).\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc6541a8",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "```\n",
    "Original sentence is:\n",
    "Hello! We're just about to see this function in action =)\n",
    "\n",
    "After standardizing:\n",
    "hello! just see function action =)\n",
    "\n",
    "\n",
    "--- Apply the standardization to the dataset ---\n",
    "\n",
    "There are 2225 sentences in the dataset.\n",
    "\n",
    "First sentence has 737 words originally.\n",
    "\n",
    "First sentence has 436 words (after removing stopwords).\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a446bcc2-03b3-4c3a-8f20-35cf5f3ee91c",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_standardize_func(standardize_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42be5c19",
   "metadata": {},
   "source": [
    "With the dataset standardized you could go ahead and convert it to a `tf.data.Dataset`, which you will NOT be doing for this assignment. However if you are curious, this can be achieved like this:\n",
    "\n",
    "```python\n",
    "dataset = tf.data.Dataset.from_tensor_slices((standard_sentences, labels))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93500124",
   "metadata": {},
   "source": [
    "## Exercise 3: fit_vectorizer\n",
    "\n",
    "Now that your data is standardized, it is time to vectorize the sentences of the dataset. For this complete the `fit_vectorizer` below. \n",
    "\n",
    "This function should receive the list of sentences as input and return a [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) that has been adapted to those sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a27dedb7",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: fit_vectorizer\n",
    "\n",
    "def fit_vectorizer(sentences):\n",
    "    \"\"\"\n",
    "    Instantiates the TextVectorization layer and adapts it to the sentences.\n",
    "    \n",
    "    Args:\n",
    "        sentences (list[str]): lower-cased sentences without stopwords\n",
    "    \n",
    "    Returns:\n",
    "        tf.keras.layers.TextVectorization: an instance of the TextVectorization layer adapted to the texts.\n",
    "    \"\"\"\n",
    "\n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    # Instantiate the Tokenizer class by passing in the oov_token argument\n",
    "    vectorizer = None\n",
    "\n",
    "    # Adapt to the sentences\n",
    "    \n",
    "\n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return vectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edfc1352-1a2c-4e70-b656-eba2a1e23f0d",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Create the vectorizer adapted to the standardized sentences\n",
    "vectorizer = fit_vectorizer(standard_sentences)\n",
    "\n",
    "# Get the vocabulary\n",
    "vocabulary = vectorizer.get_vocabulary()\n",
    "\n",
    "print(f\"Vocabulary contains {len(vocabulary)} words\\n\")\n",
    "print(\"[UNK] token included in vocabulary\" if \"[UNK]\" in vocabulary else \"[UNK] token NOT included in vocabulary\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a909d650",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "```\n",
    "Vocabulary contains 33088 words\n",
    "\n",
    "[UNK] token included in vocabulary\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24289640-08f7-4b90-9ba5-936cba61e31a",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_vectorizer(fit_vectorizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4265c298",
   "metadata": {},
   "source": [
    "Next, you can use the adapted vectorizer to vectorize the sentences in your dataset. Notice that by default `tf.keras.layers.TextVectorization` pads the sequences so all of them have the same length (typically the length of the longest sentence will be used if no truncation is defined), this is important because neural networks expect the inputs to have the same size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "679f64b3-5c56-41e0-8f35-33e254debf10",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Vectorize and pad sentences\n",
    "padded_sequences = vectorizer(standard_sentences)\n",
    "\n",
    "# Show the output\n",
    "print(f\"First padded sequence looks like this: \\n\\n{padded_sequences[0]}\\n\")\n",
    "print(f\"Tensor of all sequences has shape: {padded_sequences.shape}\\n\")\n",
    "print(f\"This means there are {padded_sequences.shape[0]} sequences in total and each one has a size of {padded_sequences.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff3ef524",
   "metadata": {},
   "source": [
    "Notice that now the variable refers to `sequences` rather than `sentences`. This is because all your text data is now encoded as a sequence of integers."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2558ecac-f7c3-4417-bef7-8d926a225b01",
   "metadata": {},
   "source": [
    "## Exercise 4: fit_label_encoder\n",
    "\n",
    "\n",
    "With the sentences already vectorized it is time to encode the labels so they can also be fed into a neural network. For this complete the `fit_label_encoder` below. \n",
    "\n",
    "This function should receive the list of labels as input and return a [`tf.keras.layers.StringLookup`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) that has been adapted to those sentences. In theory you could also use `tf.keras.layers.TextVectorization` layer here but it provides a lot of extra functionality that is not required so it ends up being overkill. `tf.keras.layers.StringLookup` is able to perform the job just fine and it is much simpler.\n",
    "\n",
    "**Hints:**\n",
    "- Since all of the texts have their corresponding labels you need to ensure that the vocabulary does not include the out-of-vocabulary (OOV) token since that is not a valid label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa48558a",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: fit_label_encoder\n",
    "\n",
    "def fit_label_encoder(labels):\n",
    "    \"\"\"\n",
    "    Tokenizes the labels\n",
    "    \n",
    "    Args:\n",
    "        labels (list[str]): labels to tokenize\n",
    "    \n",
    "    Returns:\n",
    "        tf.keras.layers.StringLookup: adapted encoder for labels\n",
    "    \"\"\"\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    # Instantiate the StringLookup layer. Remember that you don't want any OOV tokens!\n",
    "    label_encoder = None\n",
    "    \n",
    "    # Adapt the StringLookup layer to the labels\n",
    "\t\n",
    "\t\n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return label_encoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd71a405",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Create the encoder adapted to the labels\n",
    "label_encoder = fit_label_encoder(labels)\n",
    "\n",
    "# Get the vocabulary\n",
    "vocabulary = label_encoder.get_vocabulary()\n",
    "\n",
    "# Encode labels\n",
    "label_sequences = label_encoder(labels)\n",
    "\n",
    "print(f\"Vocabulary of labels looks like this: {vocabulary}\\n\")\n",
    "print(f\"First ten labels: {labels[:10]}\\n\")\n",
    "print(f\"First ten label sequences: {label_sequences[:10]}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e187514",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "```\n",
    "Vocabulary of labels looks like this: ['sport', 'business', 'politics', 'tech', 'entertainment']\n",
    "\n",
    "First ten labels: ['tech', 'business', 'sport', 'sport', 'entertainment', 'politics', 'politics', 'sport', 'sport', 'entertainment']\n",
    "\n",
    "First ten label sequences: [3 1 0 0 4 2 2 0 0 4]\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03383492",
   "metadata": {},
   "source": [
    "You should see that each encoded label corresponds to the index of its corresponding label in the vocabulary!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71e8ad91-049b-47cf-85f8-51350635a81e",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_label_encoder(fit_label_encoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcb859f2",
   "metadata": {
    "id": "6rITvNKqT-51"
   },
   "source": [
    "Great job! Now you have successfully performed all the necessary steps to train a neural network capable of processing text. This is all for now but in next week's assignment you will train a model capable of classifying the texts in this same dataset!\n",
    "\n",
    "\n",
    "**Congratulations on finishing this week's assignment!**\n",
    "\n",
    "You have successfully implemented functions to process various text data processing ranging from pre-processing, reading from raw files and tokenizing text.\n",
    "\n",
    "**Keep it up!**"
   ]
  }
 ],
 "metadata": {
  "dlai_version": "1.2.0",
  "grader_version": "1",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
